-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-3568 [mllib] add ranking metrics #2667
Conversation
QA tests have started for PR 2667 at commit
|
QA tests have finished for PR 2667 at commit
|
Test PASSed. |
* ::Experimental:: | ||
* Evaluator for ranking algorithms. | ||
* | ||
* @param predictionAndLabels an RDD of (predicted ranking, ground truth set) pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inputs are really ranks, right? would this not be more natural as Int
then?
I might have expected that the inputs were instead predicted and ground truth "scores" instead, in which case Double
makes sense. But then the methods need to convert to rankings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Thanks for all the comments! The current implementation only considers binary relevance, meaning the input is just labels instead of scores. It is true that Int
is enough. IMHO, using Double
is compatible with the current mllib setting and could be extended to deal with the score setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, are these IDs of some kind then? the example makes them look like rankings, but maybe that's coincidence. They shouldn't be labels, right? Because the resulting set would almost always be {0.0,1.0}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Take the first test case for example, (Array(1, 6, 2, 7, 8, 3, 9, 10, 4, 5), Array(1 to 5))
, this means for this single query, the ideal ranking algorithm should return Document 1 to Document 5. However, the ranking algorithm returns 10 documents, with IDs (1, 6, .....). This setting right now only works for binary relevance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I get it. But these could also just as easily be Strings maybe? like document IDs? anything you can put into a set and look for. Could this even be generified to accept any type? At least, Double
seemed like the least likely type for an ID. It's not even what MLlib overloads to mean "ID"; for example ALS assumes Int
IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @srowen 's suggestion. We can use a generic type here with a ClassTag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and if the code is just going to turn the arguments into Set
then it need not be an RDD
of Array
, but something more generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen We can optimize the implementation later, if it is worth doing, and using Array may help. Usually, both predictions and labels are small. Scanning it sequentially may be faster than toSeq
and contains
. Array is also consistent across Java, Scala, and Python, and storage efficient for primitive types.
QA tests have started for PR 2667 at commit
|
QA tests have finished for PR 2667 at commit
|
Test PASSed. |
/** | ||
* Returns the precsion@k for each query | ||
*/ | ||
lazy val precAtK: RDD[Array[Double]] = predictionAndLabels.map {case (pred, lab)=> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an RDD may not be useful in evaluation. Usually people look for scalar metrics. prec@k
and ndge@k
on a test dataset usually mean the average, but not individual ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be private? Think about what users call it for. Even we make it public, users may still need some aggregate metrics out of it. For the first version, I think it is safe to provide only the following:
- precisionAt(k: Int): Double
- ndcgAt(k: Int): Double
- meanAveragePrecision
{case (pred, lab)=>
-> { case (pred, lab) =>``(spaces after
{and
)`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I think it is better to use precisionAt
instead of precision
to emphasize that the parameter is a position. We use precision(class)
in MulticlassMetrics
, and users may get confused if we use the same name.
@coderxiang
|
@mengxr @coderxiang Hm come to think of it, doesn't this duplicate a lot of the purpose of https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.scala ? It also computes precision and recall, and supports multi-class classification instead of the binary case only. It feels like these two should be rationalized and not have two unrelated implementations. |
@srowen Ranking metrics are different from multiclass metrics. In general, multiclass metrics do not consider the ordering of the predictions, but just hits and misses, and they don't truncate the result. Ranking metrics consider the ordering of predictions, applying discounts and truncating the result. I think they are two sets of evaluation metrics. |
@mengxr Yes I understand these metrics. Precision / recall are binary classifier metrics at heart (but not nDCG for example). Precision@k needs ranking. That's why precision/recall turn up in I understand it is two views on the same metric, and these are collections of different metrics. Hm, maybe this makes more sense if the metrics here are clearly precision@k, recall@k, with an argument k. The other class has F-measure, true positive rate, etc. I suppose those could be implemented here too for consistency, but don't know how useful they are. Meh, anything to rationalize the purposeful difference between these two would help. Maybe it's just me. |
@srowen Making |
Can one of the admins verify this patch? |
@mengxr @srowen Sorry for late response. In the update, I added two new members that support computing |
QA tests have started for PR 2667 at commit
|
QA tests have finished for PR 2667 at commit
|
Test PASSed. |
import org.apache.spark.rdd.RDD | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra empty lines
QA tests have started for PR 2667 at commit
|
QA tests have finished for PR 2667 at commit
|
Test PASSed. |
QA tests have finished for PR 2667 at commit
|
Test PASSed. |
* Compute the average precision of all the queries, truncated at ranking position k. | ||
* | ||
* If for a query, the ranking algorithm returns n (n < k) results, the precision value will be | ||
* computed as #(relevant items retrived) / k. This formula also applies when the size of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retrived
-> retrieved
QA tests have started for PR 2667 at commit
|
Test FAILed. |
QA tests have finished for PR 2667 at commit
|
Test PASSed. |
QA tests have started for PR 2667 at commit
|
QA tests have finished for PR 2667 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks! |
Add common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including:
The following methods and the corresponding tests are implemented: